We provide a VCF to PED tool to convert from VCF to PLINK PED format. This tool has documentation for both the web interface and the Perl script.
An example Perl command to run the script would be:
perl vcf_to_ped_converter.pl -vcf ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr13.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.gz
-sample_panel_file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.sample_panel
-region 13:32889611-32973805 -population GBR -population FIN
Our filename conventions depend on the data format being named. This issue is described in more detail in the three questions below.
Bas files are statistics we generate for our alignment files which we distribute alongside our alignment files.
These are readgroup level statistics in a tab delimited manner and are described in this README
Each mapped and unmapped bam file has an associated bas file and we provide them collected together into a single file in the alignment_indices directory, dated to match the alignment release.
Our sequence files are distributed in gzipped fastq format
Our files are named with the SRA run accession E?SRR000000.filt.fastq.gz. All the reads in the file also hold this name. The files with _1 and _2 in their names are associated with paired end sequencing runs. If there is also a file with no number it is name this represents the fragments where the other end failed qc. The .filt in the name represents the data in the file has been filtered after retrieval from the archive. This filtering process is described in a README.
Our variant files are distributed in vcf format, a format initially designed for the 1000 Genomes Project which has seen wider community adoption.
The majority of our vcf files are named in the form:
**<span style”color:red”>ALL</span>.<span style”color:blue”>chrN</span> | <span style”color:green”>wgs | wex</span>.<span style”color:orange”>2of4intersection</span>.<span style”color:violet”>20100804</span>.<span style”color:darkblue”>snps | indels | sv</span>.genotypes.<span style”color:darkred”>analysis_group</span>.vcf.gz**. |
This name starts with the <span style”color:red”>population</span> that the variants were discovered in, if ALL is specifed it means all the individuals available at that date were used. Then the <span style”color:blue”>region</span> covered by the call set, this can be a chromosome, <span style”color:green”>wgs</span> (which means the file contains at least all the autosomes) or <span style”color:green”>wex</span> (this represents the whole exome) and a <span style”color:orange”>description</span> of how the call set was produced or who produced it, the <span style”color:violet”>date</span> matches the sequence and alignment freezes used to generate the variant call set. Next a field which describes what <span style”color:darkblue”>type of variant</span> the file contains, then the <span style”color:darkred”>analysis group</span> used to generate the variant calls, this should be low coverage, exome or integrated and finally we have either sites or genotypes. A sites file just contains the first 8 columns of the vcf format and the genotypes files contain individual genotype data as well.
Release directories should also contain panel files which also describe what individuals the variants have genotypes for and what populations those individuals are from
All our alignment files are in BAM format, a standard alignment format which was defined by the consortium and has since seen wide community adoption. We also provide our alignments in CRAM Format
The bam file names look like:
<span style”color:red”>NA00000</span>.<span style”color:blue”>location</span>.<span style”color:green”>platform</span>.<span style”color:orange”>population</span>.<span style”color:violet”>analysis_group</span>.<span style”color:darkred”>YYYYMMDD</span>.bam
The bai index and bas statistics files are also named in the same way.
The name includes the <span style”color:red”>individual sample ID</span>, <span style”color:blue”>where the sequence is mapped to</span>, if the file has only contains mapping to a particular chromosome that is what the name contains otherwise, mapped means the whole genome mapping and unmapped means the reads which failed to map to the reference (pairs where one mate mapped and the other didn’t stay in the mapped file), <span style”color:green”>the sequencing platform</span>, <span style”color:orange”>the ethnicity of the sample</span> using our three letter population code, <span style”color:violet”>the sequencing strategy</span>. The <span style”color:darkred”>date</span> matches the date of the sequence used to build the bams and can also be found in the sequence.index filename.
We distribute our fastq files for our paired end sequencing in 2 files, mate1 is found in a file labelled _1 and mate2 is found in the file labelled _2. The files which do not have a number in their name are singled ended reads, this can be for two reasons, some sequencing early in the project was singled ended also, as we filter our fastq files as described in our README if one of a pair of reads gets rejected the other read gets placed in the single file.